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Abstract 

We revisit the longest common extension (LCE) problem, that is, preprocess a string T into a compact 
data structure that supports fast LCE queries. An LCE query takes a pair of indices in T and 

returns the length of the longest common prefix of the suffixes of T starting at positions i and j. We study 
the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case 
time for answering an LCE query. Let n be the length of T. Given a parameter r, 1 < r < n, we show 
how to achieve either 0(n/y/r) space and O(r) query time, or 0(n/r) space and 0(r log(|LCE(i, j)\/r)) 
query time, where |LCE(i, j)\ denotes the length of the LCE returned by the query. These bounds provide 
the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the 
extremes when r = 1 or r = n. We apply the result to obtain improved bounds for several applications 
where the LCE problem is the computational bottleneck, including approximate string matching and 
computing palindromes. Finally, we also present an efficient technique to reduce LCE queries on two 
strings to one string. 

1 Introduction 

Given a string T, the longest common extension of suffix i and j, denoted LCE(i, j), is the length of the 
longest common prefix of the suffixes of T starting at position i and j. The longest common extension 
problem (LCE problem) is to preprocess T into a compact data structure supporting fast longest common 
extension queries. 

The LCE problem is a basic primitive that appears as a subproblem in a wide range of string matching 
problems such as approximate string matching and its variations [3,7,19,21,27], computing exact or ap- 
proximate tandem repeats [11,20,23], and computing palindromes. In many of the applications, the LCE 
problem is the computational bottleneck. 

In this paper we study the time-space trade-offs for the LCE problem, that is, the space used by the 
preprocessed data structure vs. the worst-case time used by LCE queries. We assume that the input string 
is given in read-only memory and is not counted in the space complexity. There are essentially only two 
time-space trade-offs known: At one extreme we can store a suffix tree combined with an efficient nearest 
common ancestor (NCA) data structure [13] (other combinations of 0(n) space data structures for the 
string can also be used to achieve this bound). This solution uses 0{n) space and supports LCE queries in 
O(l) time. At the other extreme we do not store any data structure and instead answer queries simply by 
comparing characters from left-to-right in T. This solution uses O(l) space and answers an LCE(i, j) query 
in 0(|LCE(£, = 0(n) time. Using succinct data structures the space for the first solution can be slightly 
improved to 0(n) bits (see e.g. [9]). The second solution was recently shown to be very practical [14]. 
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1.1 Our Results 



We show the following main result. 

Theorem 1 For a string T of length n, and any parameter t, 1 < t < n, we can solve the longest common 
extension problem 



time bound is achieved with high probability, that is, with probability at least 1 — l/n c for any constant 
c. All other bounds are worst case. 

Hence, we provide a smooth time-space trade-off that allows several new and non-trivial bounds. For 
instance, with t = ^/n Theorem l(i), gives a solution using 0(n 3 ' 4 ) space and 0{y/n) time. If we allow 
randomisation, we can use Theorem 1 (ii) to further reduce the space to O(y^n) while using query time 
0(v / nlog(|LCE(i, j)\/\/n)) = O(y^nlogn). Note that at both extremes of the trade-off (r = 1 or r = n) we 
almost match the previously known bounds. 

We also give a reduction from Range Minimum Queries that show that any data structure using 0(n/r) 
bits space additional to the string T must use at least fifr) time to answer an LCE query. Hence, the 
time-space trade-off of Theorem 1 (ii) is almost optimal. 

Furthermore, we also consider LCE queries between two strings, i.e. the pair of indices to an LCE query 
is from different strings. We present a general result that reduces the query on two strings to a single one of 
them. When one of the strings is significantly smaller than the other, we can combine this reduction with 
Theorem 1 to obtain even better time-space trade-offs. 

1.2 Techniques 

The high-level idea in Theorem 1 is to combine and balance out the two extreme solutions for the LCE 
problem. For Theorem l(i) we use difference covers to sample a set of suffixes of T of size 0(n/^/r). We 
store a compact trie combined with an NCA data structure for this sample using 0{n/yfr) space. To answer 
an LCE query we compare characters from T until we get a mismatch or reach a pair of sampled suffixes, 
which we then immediately compute the answer for. By the properties of difference covers we compare at 
most 0(t) characters before reaching a pair of sampled suffixes. Similar ideas have previously been used to 
achieve trade-offs for suffix array and LCP array construction [16,28]. 

For Theorem 1 (ii) we show how to use Rabin-Karp fingerprinting [17] instead of difference covers to 
reduce the space further. We show how to store a sample of 0(n/r) fingerprints, and how to use it to 
answer LCE queries using doubling search combined with directly comparing characters. This leads to the 
output-sensitive 0(r log(|LCE(i, j)|/r)) query time. We reduce space compared to Theorem 1 by computing 
fingerprints on-the-fly as we need them. Initially, we give a Monte-Carlo style randomised data structure 
that may answer queries incorrectly (see Theorem 5). However, this solution uses only 0(n) preprocessing 
time and is therefore of independent interest in applications that can tolerate errors. To get the error- free 
Las- Vegas style bound of Theorem 1 (ii) we need to verify the fingerprints we compute are collision free; i.e. 
two fingerprints are equal iff the corresponding substrings of T are equal. The main challenge is to do this in 
only O(nlogn) time. We achieve this by showing how to efficiently verify fingerprints of composed samples 
which we have already verified, and by developing a search strategy that reduces the fingerprints we need to 
consider. 

Finally, the reduction for LCE on two strings to a single string is based on a simple and compact 
encoding of the larger string using the smaller string. The encoding could be of independent interest in 
related problems, where we want to take advantage of different length input strings. 




space, 0(t) query time, and O(^) preprocessing time, or in 




query time, and O(nlogn) preprocessing time. The preprocessing 
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1.3 Applications 

With Theorem 1 we immediately obtain new results for problems based on LCE queries. We review some 
the most important below. 

1.3.1 Approximate String Matching 

Given strings P and T and an error threshold k, the approximate string matching problem is to report all 
ending positions of substrings of T whose edit distance to P is at most k. The edit distance between two 
strings is the minimum number of insertions, deletions, and substitutions needed to convert one string to the 
other. Let m and n be the lengths of P and T. The Landau- Vishkin algorithm [21] solves approximate string 
matching using 0(nk) LCE queries on P and substrings of T of length 0(m). Using the standard linear 
space and constant time LCE data structure, this leads to a solution using O(nk) time and 0(m) space (the 
0(m) space bound follows by the standard trick of splitting T into overlapping pieces of size 0(m)). If we 
plug in the results from Theorem 1 we immediately obtain the following result. 

Theorem 2 Given strings P and T of lengths m and n, respectively, and a parameter t, 1 < t < m, we 
can solve approximate string matching 

(i) in 0(^=) space and 0(nk ■ r + time, or 

(ii) in O(^) space and 0(nk ■ rlogm) time with high probability. 

To the best of our knowledge these are the first non-trivial bounds for approximate string matching using 
o(m) space. 

1.3.2 Palindromes 

Given a string T the palindrome problem is to report the set of all maximal palindromes in T . A substring 
T[i ...j] is a maximal palindrome iff T[i ...j] = T[i . . . j] R and T[i - 1 . . .j + 1] ^ T[i - 1 . . . j + 1} R . 
Here T[i...j] R denotes the reverse of T[i ...j). Any palindrome in T occurs in the middle of a maximal 
palindrome, and thus the set of maximal palindromes compactly represents all palindromes in T. The 
palindrome problem appears in a diverse range of applications, see e.g. [2,4, 10, 15, 18,22,25]. 

We can trivially solve the problem in 0(n 2 ) time and O(l) space by a linear search at each position in 
T to find the maximal palindrome. With LCE queries we can immediately speed up this search. Using the 
standard 0(n) space and constant time solution to the LCE problem this immediately implies an algorithm 
for the palindrome problem that uses 0(n) time and space (this bound can also be achieved without LCE 
queries [24]). Using Theorem 1 we immediately obtain the following result. 

Theorem 3 Given a string of length n and a parameter r, 1 < t < n, we can solve the palindrome problem 
(i) in 0(;^=) space andO^^+nr) time. 

(ii) in 0(7) space and 0(n ■ rlogn) time with high probability. 

For r = w(l), these are the first sublinear space bounds using o(n 2 ) time. Similarly, we can substitute our 
LCP data structures in the LCP-based variants of palindrome problems such as complemented palindromes, 
approximate palindromes, or gapped palindromes, see e.g. [18]. 

1.3.3 Tandem Repeats 

Given a string T, the tandem repeats problem is to report all squares, i.e. consecutive repeated substrings in 
T. Main and Lorentz [23] gave a simple solution for this problem based on LCP queries that achieves 0(n) 
space and 0(nlogn + occ) time, where occ is the number of tandem repeats in T. Using different techniques 
Gasieniecs et al. [12] gave a solution using 0(1) space and (9(nlogn + occ) time. Using Theorem 1 we 
immediately obtain the following result. 
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Theorem 4 Given a string of length n and parameter t, 1 < r < n, we can solve the tandem repeats 
problem 

(i) in 0(^7=) space and O + nr • log n + occ) time. 

(nj m 0(7) space and 0(nr • log 2 n + occ) time with high probability. 

While this does not improve the result by Gasieniecs et al. it provides a simple LCP-based solution. Fur- 
thermore, our result generalizes to the approximate versions of the tandem repeats problem, which also have 
solutions based on LCP queries [20]. 

2 The Deterministic Data Structure 

We now show Theorem l(i). Our deterministic time-space trade-off is based on sampling suffixes using 
difference covers. 

2.1 Difference Covers 

A difference cover modulo t is a set of integers D C {0,1,..., r — 1} such that for any distance d G 
{0, 1, . . . , t — 1}, D contains two elements separated by distance d modulo r (see example 1). 

Example 1 The set D = {1,2,4} is a difference cover modulo 5. 



d 
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1,4 
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A difference cover D can cover at most \D\ 2 differences, and hence D must have size at least y/r. We can 
also efficiently compute a difference cover within a constant factor of this bound. 



Lemma 1 (Colbourn and Ling [6]) For any t, a difference cover modulo r of size at most Vl-5r + 6 
can be computed in 0(y/r) time. 



2.2 The Data Structure 

Let T be a string of length n and let r, 1 < r < n, be a parameter. Our data structure consists of the 
compact trie of a sampled set of suffixes from T combined with a NCA data structure. The sampled set of 
suffixes S is the set of suffixes obtained by overlaying a difference cover on T with period r, that is, 



{i I 1<!<b A i mod t e D} . 



Example 2 Consider the string T = dbcaabcabcaabcac. As shown below, using the difference cover from 
Example 1, we obtain the suffix sample S = {1, 2, 4, 6, 7, 9, 11, 12, 14, 16}. 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

T=dbcaabcabcaabcac 
\ • • • \ • • • • • • • • • 

D D D D 
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By Lemma 1 the size of S is 0(n/y/r). Hence the compact trie and the NCA data structures use 0(nj \fr) 
space. We construct the data structure in 0(n 2 /y/r) time by inserting each of the 0(n/^/r) sampled suffixes 
in 0(n) time. 

To answer an LCE(i, j) query we explicitly compare characters starting from i and j until we either get 
a mismatch or we encounter a pair of sampled suffixes. If we get a mismatch we simply report the length of 
the LCE. Otherwise, we do a NCA query on the sampled suffixes to compute the LCE. Since the distance 
to a pair of sampled suffixes is at most r the total time to answer a query is 0(t). This concludes the proof 
of Theorem l(i). 

3 The Monte- Carlo Data Structure 

We now show Theorem 5 below which is an intermediate step towards proving Theorem 1 (ii) but is also 
of independent interest, providing a Monte-Carlo time-space trade-off. The technique is based on sampling 
suffixes using Rabin-Karp fingerprints. These fingerprints will be used to speed up queries with large LCEp 
values while queries with small LCEp values will be handled naively. 

Theorem 5 For a string T of length n, and any parameter t, 1 < t < n, we can solve the LCE problem in 
O (^) space, O (V log ^IMMmII^ query time, and 0(n) preprocessing. The solution is randomised (Monte- 
Carlo); with high probability, all queries are answered correctly. 

3.1 Rabin-Karp fingerprints 

Rabin-Karp fingerprints are defined as follows. Let 2n c+4 < p < An c+A be some prime and choose b E Z p 
uniformly at random. Let S be any substring of T, the fingerprint <j){S) is given by, 

\s\ 

<f>(s) = J2 s ^ bk m ° d p- 

k=l 

Lemma 2 gives a crucial property of these fingerprints (see e.g. [17] for a proof). That is with high probability 
we can determine whether any two substrings of T match in constant time by comparing their fingerprints. 

Lemma 2 Let (j) be a fingerprinting function picked uniformly at random (as described above). With high 
probability, 

cj>(T[i...i + a-l]) = <j>(T\j...j + a-l]) 

iffT[i...i + a-l] = T[j...j + a-l] for alii, j, a. (1) 

3.2 The Data Structure 

The data structure consists of the fingerprint, of each suffix of the form T[kr . . . n] for < k < n/r, 
i.e. cj)k — (j)(T[kT . . . n]). Note that there are 0(n/r) such suffixes and the starting points of two consecutive 
suffix are r characters apart. Since each fingerprint uses constant space the space usage of the data structure 
is 0(n/r). The n/r fingerprints can be computed in left-to-right order by a single scan of T in 0(n) time. 

Queries The key property we use to answer a query is given by Lemma 3. 

Lemma 3 The fingerprint 4>(T[i . . . i + a — 1]) of any substring T[i . . . i + a — 1] can be constructed in 0(t) 
time. Ifi,a are divisible by t, the time becomes 0(1). 
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Proof. Let ki = \i/r~\ and fc 2 = \(i + a)/r] and observe that we have <j> kl and (f>k 2 stored. By the definition 
of <f>, we can compute 4>{T[k\T . . . fc 2 r — 1]) = (j) kl — <j>k 2 ' M fe2 ~ fel ) T mod p in O(l) time. If i, a — mod r then 
fciT = i and fc 2 r = i + a and we are done. Otherwise, similarly we can then convert (j){T[k\T . . . fc 2 r — 1]) 
into (j)(T[kiT — 1 . . . fc 2 r — 2]) in O(l) time by inspecting T[k\T — 1] and T[/c 2 r — 1]. By repeating this final 
step we obtain T[i . . . i + a — 1] in 0(t) time. □ 

We now describe how to perform a query by using fingerprints to compare substrings. We define <j> l k = 
(f>(T[kr . . . (k + 2 e )r — 1]) which we can compute in O(l) time for any k, £ by Lemma 3. 

First consider the problem of answering a query of the form LCE(ir, jt). Find the largest £ such that 
<f>\ = When the correct £ is found convert the query into a new query LCE((i + 2 e )r, (j + 2 £ )r) and 
repeat. If no such £ exists, explicitly compare T[it ...(« + 1)t — 1] and T[jr . . . (j + l)r — 1] one character 
at a time until a mismatch is found. Since no false negatives can occur when comparing fingerprints, such 
a mismatch exists. Let £o be the value of I obtained for the initial query, LCE(ir,jr), and l q the value 
obtained during the g-th recursion. For the initial query, we search for £q in increasing order, starting with 
£a = 0. After recursing, we search for £ q in descending order, starting with £ q -\. By the maximality of 
£q-i, we find the correct £ q . Summing over all recursions we have O(£o) total searching time and 0(t) time 
scanning T. The desired query time follows from observing that by the maximality of £$. we have that 
0(r + £ ) = 0(r + log(|LCE|/r)). 

Now consider the problem of answering a query of the form LCE(ir, jr+j) where < 7 < r. By Lemma 3 
we can obtain the fingerprint of any substring in O(r) time. This allows us to use a similar approach to the 
first case. We find the largest £ such that 0(T[jr + 7 . . . (j + 2 e )r + j — 1]) = <f>\ and convert the current query 
into a new query, LCE((i + 2 £ )r, (j + 2 1 )t + 7). As we have to apply Lemma 3 before every comparison, we 
obtain a total complexity of 0(r log(|LCE|/r)). 

The general case can be reduced to one of the first two cases by scanning O(r) characters in T. By 
Lemma 2, all fingerprint comparisons are correct with high probability and the result follows. 

4 The Las- Vegas Data Structure 

We now show Theorem l(ii). The important observation is that when we compare the fingerprints of two 
strings during a query in Section 3, one of them is of the form T[jr . . . jt+t-2 1 — 1] for some £, j. Consequently, 
to ensure all queries are correctly computed, it suffices that <ft is r-good: 

Definition 1 A fingerprinting function, (f> is r-good on T iff 

<p(T[jT...jr + T-2 e - 1]) = <f>(T[i...i + T- 2 l - 1]) 

iffT[ 3 T...j T + T-2 l -1] = T[i...i + r-2 ( -l] forall(i,j,e). (2) 

In this section we give an algorithm which decides whether a given <f> is r-good on string T. The algorithm 
uses 0(n/r) space and takes 0(n log n) time with high probability. By Lemma 2, a uniformly chosen <f> is 
r-good with high probability and therefore (by repetition) we can generate such a in the same time/space 
bounds. For brevity we assume that n and r are powers-of-two. 

4.1 The Algorithm 

We begin by giving a brief overview of Algorithm 1. For each value of £ in ascending order (the outermost 
loop), Algorithm 1 checks (2) for all For some outermost loop iteration £, the algorithm inserts the 
fingerprint of each block-aligned substring into a dynamic perfect dictionary, Dn (lines 3-9). A substring is 
block-aligned if it is of the form, T[jr . . . (j + 2 1 )t — 1] for some j (and block-unaligned otherwise). If more 
than one block-aligned substring has the same fingerprint, we insert only the left-most as a representative. 
For the first iteration, £ = we also build an Aho-Corasick automaton [1], denoted AC, with a pattern 
dictionary containing every block-aligned substring. 
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The second stage (lines 12-21) uses a sliding window technique, checking each time we slide whether 
the fingerprint of the current (2 £ r)-length substring occurs in the dynamic dictionary, Dp. If so we check 
whether the corresponding substrings match (if not a collision has been revealed and we abort). For £ > 0, 
we use the fact that (2) holds for all i, j with £ — 1 (otherwise, Algorithm 1 would have already aborted) to 
perform the check in constant time (line 18). I.e. if there is a collision it will be revealed by comparing the 
fingerprints of the left-half {L\ ^ L k ) or right-half {R[ ^ R k ) of the underlying strings. For £ = 0, the check 
is performed using the AC automaton (lines 20-21). We achieve this by feeding T one character at a time 
into the AC. By inspecting the state of the AC we can decide whether the current r-length substring of T 
matches any block-aligned substring. 



Algorithm 1 Verifying a fingerprinting function, <ft on string T 
1: // AC is an Aho-Corasick automaton and each Dg is a dynamic dictionary 
2: for £ = . . . log 2 (n/r) do 



3: // Insert all distinct block-aligned substring fingerprints into Dg 

4: for j = 1 . . . n/r — 2 l do 

5: f j ^ ( f ) (T[jT...(j+2 £ )T-l}) 

6: Lj <- cb(T[jr . . . (j + - 1]), Rj <- <j>{T[{j + 2<- 1 )t . . . (j + 2 e )r - 1]) 

7: if jB(fk, L k ,R k , fc) e D t such that fj = f k then 

8: Insert (fj,Lj,Rj,j) into Dg indexed by fj 

9: if £ = then Insert T[jr . . . (j + l)r - 1] into AC dictionary 

10: / / Check for collisions between any block-aligned and unaligned substrings 

11: if £ = then Feed T[l . . . r - 1] into AC 

12: for i = 1 . . . n - t ■ 2 l + 1 do 

13: //«-0(T[i...* + T-2<-l]) 

14: L'. 0(T[t . . . i + t ■ 2 £ -! - 1]), <- 0(T[(i + 2 £ -!)r . . . i + r • 2 e - 1]) 

15: if £ = then Feed T[i + r - 1] into AC // AC now points at T[i . . . i + t - 1] 

16: if 3(/ fe , L k ,R k ,k) e D( such that // = f k then 

17: if ^ > then 

18: if (L- 7^ L fe or R\± R k ) then abort 

19: else 

20: Compare T[i . . . i + r — 1] to T[/ct . . . (k + l)r - 1] by inspecting AC 

21: if T[i . . . i + t - 1] ^ T[kr . . . (k + l)r - 1] then abort 



Correctness We first consider all points at which Algorithm 1 may abort. First observe that if line 21 
causes an abort then (2) is violated for (i, k, 0). Second, if line 18 causes an abort either L\ ^ L k or _R- ^ R k . 
By the definition of </>, in either case, this implies that T[i . . . i + r ■ 2 e — 1] ^ T[kr . . . fcr + 2 1 t — 1]. By line 
16, we have that = f k and therefore (2) is violated for (i, k,£). Thus, Algorithm 1 does not abort if <f> is 
r-good. 

It remains to show that Algorithm 1 always aborts if </> is not r-good. Consider the total ordering on 
triples (i,j,£) obtained by stably sorting (non-decreasing) by £ then j then i. E.g. (1,3,1) < (3,2,3) < 
(2,5,3) < (4,5,3). Let (i*,j*,£*) be the (unique) smallest triple under this order which violates (2). We 
first argue that (fj,,Lj,,Rj*,j*) will be inserted into Dg» (and AC if £* =0). For a contradiction assume 
that when Algorithm 1 checks for fj- in Dg* (line 7, with j = ]*,£ — £*) we find that some f k = fj- already 
exists in D e ., implying that k < j* . If T[j*T...j*r + t2 £ - 1] ^ T[kr...kr + t2 £ - 1] then (j*r,k,£*) 
violates (2). Otherwise, (i*,k,£*) violates (2). In either case this contradicts the minimality of (i*,j*,£*) 
under the given order. 

We now consider iteration i = i* of the second inner loop (when £ = £*). We have shown that 
(fj*,Lj*,Rj*,j*) e Dp and we have that /,'. = fj* (so k = j*) but the underlying strings are not equal. 
If I = then we also have that T[j*r . . . (j* + 1)t — 1] is in the AC dictionary. Therefore inspecting the 
current AC state, will cause an abort (lines 20-21). If £ > then as (i*,j*,£*) is minimal, either L-» ^ Lj. 
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or R' it 7^ Rj* which again causes an abort (line 18), concluding the correctness. 

Time- Space Complexity We begin by upper bounding the space used and the time taken to performs 
all dictionary operations on Di for any £. First observe that there are at most 0(n/r) insertions (line 8) and 
at most 0(n) look-up operations (lines 7,16). We choose the dictionary data structure employed based on 
the relationship between n and t. If t > y/n then we use the deterministic dynamic dictionary of Ruzic [29]. 
Using the correct choice of constants, this dictionary supports look-ups and insert operations in O(l) and 
0(y/n) time respectively (and linear space). As there are only 0(n/r) = 0(y/n) inserts, the total time taken 
is 0{n) and the space used is 0(n/r). If r < \fn we use the Las- Vegas dynamic dictionary of Dietzfelbinger 
and Meyer auf der Heide [8]. If Q(y/n) = 0(n/r) space is used for D(, as we perform 0(n) operations, 
every operation takes O(l) time with high probability. In either case, over all I we take O(nlogn) total time 
processing dictionary operations. 

The operations performed on AC fall conceptually into three categories, each totalling O(nlogn) time. 
First we insert 0(ti/t) r-length substrings into the AC dictionary (line 9). Second, we feed T into the 
automaton (line 11,15) and third, we inspect the AC state at most n times (line 20). We store AC in the 
standard compacted form with edge labels stored as pointers into T. This means that the space used is 
0(n/r), the number of strings in the AC dictionary. 

Finally we bound the time spent constructing fingerprints. We first consider computing j[ (line 13) for 
i > 1. We can compute f[ in 0(1) time from f' i _ 1 , T[i — 1] and T[i + t ■ 2 1 ]. This follows immediately 
from the definition of <p. We can compute L\ and R\ analogously. Over all i,£, this gives O(nlogn) time. 
Similarly we can compute fj from fj-i, T[(j — l)r . . .jr — 1] and T[(j — 1 + 2 1 )t . . . (j + 2 l ) — 1] in 0(t) 
time. Again this is analogous for L\ and R\. Summing over all j, I this gives O(nlogn) time again. Finally 
observe that the algorithm only needs to store the current and previous values for each fingerprint so this 
does not dominate the space usage. 

5 Longest Common Extensions on Two Strings 

We now show how to efficiently reduce LCE queries between two strings to LCE queries on a single string. 
We generalize our notation as follows. Let P and T be strings of lengths n and m, respectively. Define 
LCEp^t («,.?) to be the length of the longest common prefix of the substrings of P and T starting at i and 
j, respectively. For a single string P, we define LCEp(i, j) as usual. We can always trivially solve the LCE 
problem for P and T by solving it for the string obtained by concatenating P and T. We show the following 
improved result. 

Theorem 6 Let P and T be strings of lengths m and n, respectively. Given a solution to the LCE problem 
on P using s(m) space and q(m) time and a parameter t, 1 < r < n, we can solve the LCE problem on P 
and T using + s(m)) space and 0(t + q{m)) time. 

For instance, plugging in Theorem l(i) in Theorem 6 we obtain a solution using + space and 0(t) 
time. Compared with the direct solution on the concatenated string that uses O(^j^S) we save substantial 
space when m <C n. 

5.1 The Data Structure 

The basic idea for our data structure is inspired by a trick for reducing constant factors in the space for the 
LCE data structures [10, Ch. 9.1.2]. We show how to extend it to obtain asymptotic improvements. Let P 
and T be strings of lengths m and n, respectively. Our data structure for LCE queries on P and T consists 
of the following information. 

• A data structure that supports LCE queries for P using s(m) space and q(m) query time. 

• An array A of length [^J such that A[i] is the maximum LCE value between any suffix of P and the 
suffix of T starting at position i ■ r, that is, A[i] = max J= i... m LCEp,t(.7, it). 
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• An array B of length such that B [i] is the index in P of a suffix that maximizes the LCE value, 
that is, B[i] — argmax J=1 _ m LCEp^j, it). 

Arrays A and B use 0(n/r) space and hence the total space is 0(n/r + s(m)). We answer an LCEpy query 
as follows. Suppose that LCEp t T(i,j) < r. In that case we can determine the value of LCEp^i, j) in 0(r) 
time by explicitly comparing the characters from position i in P and j in T until we encounter the mismatch. 
If LCEp i t(«, j) > t, we explicitly compare k < r characters until j + k = (mod r). When this occurs we 
can lookup a suffix of P, which the suffix j + k of T follows at least as long as it follows the suffix i + k of 
P. This allows us to reduce the remaining part of the LCEp^ query to an LCEp query between these two 
suffixes of P as follows. 

LCEp T (i, j) = k + min 

We need to take the minimum of the two values, since, as shown by Example 3, it can happen that the LCE 
value between the two suffixes of P is greater than that between suffix i + k of P and suffix j + k of T. In 
total, we use 0(t + q(m)) time to answer a query. This concludes the proof of Theorem 6. 

Example 3 Consider the query LCEp j t(2, 13) on the string P from Example 2 and 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

T=cacdeabaacaabcaabcdcae 



,LCE P ( i + k, B 



j + k 



The underlined positions in T indicate the positions divisible by 5. As shown below, we can use the array 
A= [0,6,4,2] andB = [16,3,11,10]. 



i iv 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

P=dbcaabcabcaabcac 


A[i] B\i\ 


1 5 

2 10 

3 15 

4 20 




z a a b c a X 




X 

i a b c X 
i X 


16 
6 3 
4 11 
2 10 



To answer the query LCEp j t(2, 13) we make k = 2 character comparisons and find that 



LCEp T (2,13) = 2 + min [A 



13 + 2 



,LCE P 2 + 2,5 



13 + 2 



= 2 + min(4, 5) = 6 . 



6 Lower Bound 

In this section we prove the lower bound for LCE data structures. 

Lemma 4 Any LCE data structure that uses 0(n/r) bits additional space for an input array of size n, 
requires f2(r) query time, for any t, where 1 < r < n. 

Range Minimum Queries (RMQ) Given an array A of integers, a range minimum query data structure 
must support queries of the form: 

• RMQ(7, r): Return the position of the minimum element in A[Z,r]. 

Brodal et al. [5] proved any algorithm that uses n/r bits additional space to solve the RMQ problem for 
an input array of size n (in any dimension), requires 0(r) query time, for any r, where 1 < r < n, even in 
a binary array A consisting only of Os and Is. Their proof is in the non-uniform cell probe model [26]. In 
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this model, computation is free, and time is counted as the number of cells accessed (probed) by the query 
algorithm. The algorithm is allowed to be non-uniform, i.e. for different values of input parameter n, we 
can have different algorithms. 

To prove Lemma 4, we will show that any LCE data structure can be used to support range minimum 
queries, using one LCE query and 0(1) space additional to the space of the LCE data structure. 

Reduction Using any data structure supporting LCE queries we can support RMQ queries on a binary 
array A as follows: In addition to the LCE data structure, store the indices i and j of the longest substring 
A[i, j) of A consisting of only l's. To answer a query RMQ(l,r) compute res = LCE(Z,i). Let z = j — i + 1 
denote the length of the longest substring of l's. Compare res and z: 

• If res < z and I + res < r, return I + res. 

• If res > z and I + z < r, return I + z. 

• Otherwise return any position in [l,r]. 

To see that this correctly answers the RMQ query consider the two cases. If res < z then A[l, I + res — 1] 
contains only l's, since A[i, j] contains only l's. Thus position I + res is the index of the first in A[l, I + res] . 
It follows that if I + res < r, then A[l + res] = and is a minimum in A[l, r}. Otherwise, A[l, r] contains only 



If res > z then A[l, I + z — 1] contains only l's and position A[l + z] = . There are two cases: Either 
I + z < r, in which case this position contains the first in A[l, r}. Or I + z > r in which case A[l, r] contains 
only l's. 
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